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Abstract 

Conditional sampling distributions (CSDs), sometimes referred to as copying models, under- 
lie numerous practical tools in population genomic analyses. Though an important application 
that has received much attention is the inference of population structure, the explicit exchange 
of migrants at specified rates has not hitherto been incorporated into the CSD in a principled 
framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD 
has been developed as an accurate, efficient approximation to a principled CSD derived from 
the diffusion process dual to the coalescent with recombination. In this paper, the sequentially 
Markov CSD framework is extended to incorporate subdivided population structure, thus pro- 
viding an efficiently computable CSD that admits a genealogical interpretation related to the 
structured coalescent with migration and recombination. As a concrete application, it is demon- 
strated empirically that the CSD developed here can be employed to yield accurate estimation 
of a wide range of migration rates. D 

Keywords: structured coalescent, recombination, migration, conditional sampling distribution, 
hidden Markov model, sequentially Markov coalescent 



1 Introduction 



Under a given population genetic model, the conditional sampling distribution (CSD), also called a 
copying model by some authors, describes the probability that an additionally sampled haplotype 
is of a certain type, given that a collection of haplotypes has already been observed. As described 
below, various applications in population genomics make use of the CSD. Although the CSD is 
of much importance, no exact closed-form expressions are known in the situations to which it has 

been applied, and so a number o f approximations have been p ropo sed. 

Following the semin al work of Stephens and Donnelly (2000) and Fearnhead and Donnelly ( 200ll ) 
Li and Stephens ( 20031 ) proposed a widely used CSD, denoted 7r LS , which models the additionally 
observed haplotype as an imperfect mosaic of the haplotypes already observed. The model underly- 
ing 7r LS can be cast as a hidde n Markov model (HMM), thus admitting efficient implementation. In 
their paper, iLi and Stephens! used the CSD 7r LS in a pseudo-likelihood framework to estimate fine- 
scale recombination rates, and subsequently fr LS and its extensions have been used i n numerous other 



population gene tic applications, including estimating gene-conversion parameters (|Gav et all 12007 



Yin et ah . 20091 ). and phasing genotype sequence data into haplotype sequence data and imputing 
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Another important application of the CSD that has re c eived much attention is the inference 
of population structure and demography. Hellenthal et al. ( 20081 ) employed 7r LS to model human 
colonization history as a sequence of founder events and estimated the order of the founding events, 
as well as the relative contribution of different founding pop ulations during the events. To estimate 
the splitting time of two populations Davison et al. ( 20091 ) modified 7r LS to incorporate the spli t 
into the copying model, and used the same pseudo-likelihoo d framework as iLi and Stephens! l|200al ) 
to estimate the time of splitting. In a more recent study, Lawson et al. ( 20121 ) applied tttv to a 
sample of DNA sequences and used properties of the inferred mosaic pattern to reveal structure in 
the underlying population. 

To handle admixture, a modification to 7r LS was introduced by Price et al. ( 20091 ). who assumed 
that the previously observed haplotypes in the CSD are from two distinct ancestral populations 
(e.g., African and European). In modeling the mosaic pattern for a haplotype sampled from 
the admixed population (e.g., African American), it is then assumed more likely that adjacent 
segments orig inate from the same ancestral population, rather than from two different ancestral 
populations. 



Price et al. 



applied this modified copying model to detect chromosomal segments 
of distinct ancestry in admixed individuals and estimated admixtu re fractions in recently admixed 
populations. The same model was applied by lWegmann et al] (|201lh . who used the inferred ancestry 
switch-points to estimate relative recombination rates between different populations. 

As discussed above, 7r LS is a very useful CSD with a variety of applications, but it was not 
derived from, though was certai nly motivated by, prin c iples u nderlying the coalescent process. To 
derive CSDs in a principled way, iDe Iorio and Griffiths! ()2004ah introduced a general approximation 
technique based on the diffusion process dual to the coalescent; this work w as first presented in the 
case of a single locus and a panmictic population, but in a companion paper ( De Iorio and Griffiths! . 



200 4bf) the au t hors applied the method to the case of a subdivided population with migration. 



Griffiths et al.l (|2008l ) extended the diffusion approximation technique to handle re combination in 



the special case of two loci with parent-independent mutation at each locus, and iPaul and Song 
later generalized the framework to an arbitrary number of loci and an arbitrary finite-alleles 
ion model. 

Although more acc urate than the CSDs deve loped bvlFearnhead and Donnelly! (]200ll ) and by 



10) 



Li and Stephens] <j200ah . the CSD 7r PS proposed by lPaul and Songj (|2010l ) is not amenable to efficient 



evaluation. More precisely, 7r PS can be computed by solving a recursion that becomes intractable 
for a lar ge number of loci. However, utilizing ideas related to the sequentially Marko v coalescent 
(SMC) (iwiuf and Heinl . 1 19991 : iMcVean and Cardinl . 1200.4 iMarjoram and Walll . l200fih . which is a 
simplified genealog ical process that c aptures the essential features of the full coalescent model with 
recombination, we (jPaul et alll201ll ) recently developed an approximation to 7r PS that could be cast 
as an HMM with continuous hidden state space. Furthermore, upon discretizing this continuous 
state space, we obtained an accurate appro ximat ion with computationa l efficiency comparable to 
the CSDs of Fearnhead and Donnelly ( 2001 ) and Li and Stephens ( 20031 ) . 

In this paper, we extend our previous work on the seq uentially Marko v CSD to incorporate 
subdivided population structure with migration. Following IPaul and Son g] (|2010h . we describe a 
genealogical process for an additionally sampled haplotype conditioned on the genealogy of already 
observed haplotypes. We present a recursion t hat can be used to c ompute the probability of 
the additionally sampled haplotype, but, as in IPaul and Song solving this 



recursion is 
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tractable only for a small number of loci. As in IPaul et alJ l|201ll ). we apply the sequentially 
Markov framework to the conditional genealogical process with migration and recombination, and 
obtain an accurate approximation that facilitates computation for a large number of loci. As a 
concrete application, we demonstrate empirically that our new CSD can be employed in various 
pseudo-likelihoods to produce accurate estimation of a wide range of migration rates. 

The remainder of this paper is organized as follows: In Section [2j we introduce the notation 
adopted throughout the paper and describe the relevant population genetic model, the coalescent 
with recombination and migration. We then describe the genealogical interpretation of our CSD in 
Section [3] and introduce several approximations in Section U to obtain a CSD for which computa- 
tion is tractable. In Section [U we demonstrate the applicability of our CSD by employing it to the 
estimation of migration rates from simulated data. Finally, we conclude in Section [6] with a discus- 
sion of further applications and extensions of the CSD developed herein to estimate demographic 
parameters in more complex scenarios. 



2 Background 



In this section, we briefly describe how mi gration is integr a ted in to the coalescent with recombina- 
tion, and recall the CSD 7r PS proposed by IPaul and Song (120101 ). which we extend to incorporate 
migration in the following section. We begin by defining some general notation that will be used 
throughout. 



2.1 Notation 

We consider haplotypes in the finite-sites, finite-alleles setting. Denote the set of loci by L = 
{l,...,k} and the set of alleles at locus £ G L by Ej>; recombination may occur between any 
consecutive pair of loci, and we denote the set of potential recombination breakpoints by B = 
{(1, 2), . . . , {k — 1, k)}. The space of /c-locus haplotypes is denoted by H = E\ x • • • x E^. Given a 
haplotype h G %, we denote by h[£] G Eg the allele at locus i G L, and by h[£ : £'] (for £ < £') the 
partial haplotype [h[£], . . . , h[£']). 

We consider an island model of population structure with a finite set of demes denoted by 
r = {1, ...,</}. At a given time, each individual belongs to a single deme, and the ancestors 
and descendants of the individual may belong to different demes by means of a migration process, 
detailed in Section [2.21 

A structured sample configuration of haplotypes is specified by n = (n 7i / l ) grh6W , where n^^ 
denotes the number of haplotypes of type h within deme 7 in the sample, configuration of 
haplotypes within deme 7 G T is denoted n 7 , and the total number of haplotypes in the deme by 
|n 7 | = n 7 . The total number of haplotypes in n is denoted by n = |n| = X^er n i- Finally, we use 
e 7j / l to denote the singleton configuration comprising a single haplotype h within deme 7. 



2.2 The coalescent with recombination and migration 

The stochastic process underlying our work i s the coalescent with recombination and migration 
(| Griffiths and Marjoram! . 1 19971 : iHerbotl Il997h . Consider a structured population with a finite set 
r of demes. We denote the relative size of deme 7 G T by k 7 , where < k 7 < 1 and X^ 7 er K 7 = 1- 
Note that two individuals within a deme find a common ancestor at rate inversely proportional to 
the relative size of the deme; in the coalescent limit, coalescence within deme 7 occurs at rate nZ . 
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To allow for migration of ancestral lineages between denies, define v~~> to be the probability 
that an individual in deme 7 has a parent in deme 7'. In the coalescent limit, as the population 
size N tends to infinity, an ancestral lineage in deme 7 migrates, backwards in time, to deme 7' 
at rate m 7 y /2, where m 7 y = 4iVu 77 ' is the scaled migration rate. Henceforward, we consider a 
continuous-time Markov migration process with transition rate matrix M = (m 7 y/2) 7i7 / g r, where 
m 77 = — ^7^7 m 77'- For ease °f notation, we define m 7 = Yl-f^y m-yy' ■ 

An ancestral lineage undergoes mutation at locus i € L at rate 6i/2, where 0£ is the scaled 
mutation rate, and according to the stochastic mutation transition matrix P^. Further, as in the 
ordinary coalescent with recombination, an ancestral lineage undergoes recombination, backwards 
in time, at breakpoint b € B at rate pfc/2, where pb is the scaled recombination rate, giving rise to 
two lineages (in the same deme). 

A structured configuration n with n 7 individuals in each deme 7 can be sampled as follows. 
The process starts at present with rz 7 untyped lineages in each deme 7, and lineages in each deme 
7 evolve backwards in time subject to the following possible events: 

Mutation: Each lineage undergoes mutation at locus I € L with rate 9i/2 according to the 
mutation transition matrix P^. 

Recombination: Each lineage undergoes recombination at breakpoint b 6 B with rate pb/2. 

Migration: Each lineage migrates to deme 7' with rate m 7 y/2. 

Coalescence: Each pair of lineages coalesce with rate n~ . 

When a single lineage remains, the type at each locus I is chosen according to the stationary 
distribution of the mutation matrix P^, and this type is propagated toward the present, producing 
a realization for the sample n. 



2.3 The CSD 7r PS for a single panmictic population 

The approximate CSD 7r PS (jPaul and Song . l20ld ) for a single panmictic population is described by a 
genealogical process closely related to the coalescent with recombination. Suppose that, conditioned 
on having alrea dy observed a haploty pe configuration n, we wish to sample c additional haplotypes. 
As described in lPaul and Song ( 2O10l ). given the true fully-specified genealogy A n for the conditional 
configuration n, it is possible to sample a conditional genealogy C for the c additional haplotypes. 

The conditional genealogy C comprises the following: mutation, recombination, and coalescence 
within C, occurring at rates given in Section [2T2| and absorption of lineages into the known genealogy 
A n , occurring at rate 1 for each pair. Because the types of the lineages of A n are known, the type of 
an absorbed lineage is determined. Thus, when all lineages of C have been absorbed, the type may 
be propagated forward, thereby producing a realization for sample configuration c with |c| = c. 

Unfortunately, we do not gene rally have access to the true ge nealogy A n - Making use of the 
diffusi on-generator approximation (|De Iorio and GriffithiliooiJ bt lGriffiths et al.l . l2008l l. |Paul and Sond 
(|20ld > propose the following: Assume that A n = *4o( n )> where -4o( n ) is called the trunk geneal- 
ogy in which lineages do not mutate, recombine, or coalesce with one another, but instead form 
a non-random "trunk" extending infinitely into the past. Note that ^4o( n ) does not have a most 
recent common ancestor, and for this reason is improper; nonetheless, it remains possible to sample 
a well-defined conditional genealogy C, and thus to generate an additional sample c, in much the 
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same way as described above. In particular, lineages within C evolve backwards in time subject to 
the following events: 

Mutation: Each lineage undergoes mutation at locus I € L with rate 0£ according to P^. 
Recombination: Each lineage undergoes recombination at breakpoint b £ B with rate pb 
Coalescence: Each pair of lineages coalesce with rate 2. 

Absorption: Each lineage is absorbed into a lineage of A n = *4.o( n ) with rate 1. 

Observe that the rate of absorption is the same as in the case where A n is known. The rates for mu- 
tation, recombination, and coalescence, on the other hand, are each a factor of two larger than those 
given in Section \2. 21 intuitively, this adjustment accounts for using the (incorrect) trunk genealogy 
„4o ( n )> and notably the absence of events therein. Importantly, th e CSD 7r P c; has been shown to 
be correct for a one-locus model with parent indepe ndent mutation ( Stephens and Donnelly . 2000l : 
De Iorio and Griffiths! . l2004al : IPaul and SoneL l20ld ). a strong argument in favor of the given rate 
adjustment. The CSD tt ps is completely characterized by the above genealogical process. 

Remark. The r ates given for the genealog ical process governing tt ps are double those given by 
Paul and Song ( 2O10l ) and Paul et al. ( 2011 ). Importantly, the genealogical process is time-homogeneous, 
and so for the purposes of computing the conditional sampling probability (CSP) 7r PS (c|n), this 
modification has no effect (indeed, any constant multiple of the rates will yield the same CSP). 
However, we believe that the scaling adopted here admits a natural interpretation of the absorption 
time as a true coalescence time. For example, consider sampling a single haplotype conditional on 
a configuration n with |n| = 1; analogous to coalescence of two lines in Kingman's coalescent, 
absorption in the genealogical process associated with tt ps occurs at rate 1. 



3 A new CSD ir Mig for structured populations with recombination 
and migration 

We now introduce an approximate CSD n Mis by extending the genealogical process of Section 12.31 
to a general structured population with |T| > 1. Suppose that conditioned on having already 
observed a structured sample configuration n, we wish to sample c additional haplotypes with c 7 
of them in each deme 7. As before, given the true fully-specified genealogy A n for the conditional 
configuration n, including migration events, it is possible to sample a conditional genealogy C for 
the c additional haplotypes. The conditional genealogy C comprises the events and corresponding 
rates of Section 12.21 this time including migration, and the absorption of lineages in each deme 7 
into lineages of A n in deme 7. These latter absorption events occur at rate k" 1 . 

In practice, we do not have acces s t o the true genealogy . 4 n , but the diffusion- generator tech- 
nique (|De Iorio and Griffiths! . l2004al lbl: iGriffiths et all . I2OO8I : IPaul and Soml I2OI0I ) again implies 
the following approximation: Assume that A n = -4o( n ) = {-4o(n 7 )} 7e r, where „4o( n 7) is the 
non-random sub-trunk genealogy associated with deme 7, within which lineages do not mutate, 
recombine, migrate, or coalesce with one another. As in Section 12. 3} given this assumption it re- 
mains possible to sample a well-defined conditional genealogy C, and thus to generate the additional 
structured sample c. Specifically, lineages within each deme 7 of C evolve backwards in time subject 
to the following events: 
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Mutation: Each lineage undergoes mutation at locus £ G L with rate Oi according to the 
mutation transition matrix P^. 



Recombination: Each lineage undergoes recombination at breakpoint b £ B with rate pb- 



Migration: Each lineage migrates to deme 7' with rate m~y. 

Coalescence: Each pair of lineages coalesces with rate 2n~ 1 . 

Absorption: Each lineage is absorbed into a lineage of „4o( n 7) with rate k" 1 . 

Observe that the rates of mutation, recombination, migration, and coalescence are a factor of two 
larger than when the true genealogy A n is known. Intuitively, this again accounts for using the 
(incorrect) trunk genealogy .4o( n )> an d particularly the absence of events therein; see the remark at 
the end of Section [231 The approximate CSD 7r Mig is completely characterized by this genealogical 



process. See Figure 1(a) for an illustration. 

Remark. For strongly asymmetric migration rates, the approximate CSD 7r Mig , and in particular 
the assumed trunk genealogy Ao(n), may be very inaccurate. Consider for example the case of two 
denies and myi 3> txii\- The expected time for an additionally sampled haplotype in deme 2 to 
be absorbed into the trunk in deme 1 will be very large, since the lineages in the trunk genealogy 
„4o( n i) are confined to stay in deme 1. In case of the true genealogy A n , however, one would 
expect the lineages of the haplotypes in the observed configuration in deme 1 to cross over to deme 
2 quickly and coalesce more recently with the additional lineage. 

We now consider computing the CSP 7r M i g ( c l n )- It is possible to derive the following result di- 
rectly using the diffusion-generator approximation, but we defer this work to Appendix[A} Below, we 
obtain the result through the genealogical process detailed above; using typical forward-backward 
genealogical arguments in coalescent theory, we deduce that 7r M i g (c|n) satisfies the following equa- 
tion: 



7r M i g (c|n) 
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where Sf(h) denotes the haplotype obtained by substituting the allele at locus t of h with allele a, 
and lZb(h,h') denotes the haplotype obtained, via recombination about breakpoint b = (£,£+ 1), 
by joining the (partial) haplotypes h[l : £] and h'[£ + l,k]. The first term on the right hand 
side of this equation corresponds to coalescence and absorption of haplotype h in deme 7, and 
the subsequent terms correspond to mutation, recombination, and migration, respectively. The 
normalizing constant Af is given by 
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Equation (|3. 1 j) is for the "full" (conditional) genealogical process, and, because of the recombina- 
tion terms, it c annot be directly computed by solving a set of linear equations. However, as in 
Paul and Son3 j20ld ). it is possible to derive a "reduced" recursion related to ()3.1|) that can be 
computed by solving a finite set of linear equations. Unfortunately, the number of variables in 
the set of equations grows super-exponentially with both the number of loci and the number of 
haplotypes in the sample configuration n, making it computationally intractable for all but the 
smallest problems. In the following section, we propose accurate approximations that allow for 
efficient computation. 



4 An efficiently computable CSD as an approximation of fr- 



Mig 



As described above, the recursion for 7r Mig (c|n) becomes computationally intractable for even mod- 
est datasets. In what follows, we adopt a set of approximations to obtain a CSD that admits 
efficient implementation, while retaining the accuracy of 7r M i g - 



4.1 The CSD 

^Mi g sMc : Sequentially Markov approximation of C 

We follow Paul et al. J 201 ll) and use ideas underlying the SMC ( Wiuf and Hein , 1999 ; McVean and Cardin 



20051 : iMarioram and WailreOOd ) to approximate 7r M i g . Briefly, observe that a given conditional ge- 
nealogy induces a marginal conditional genealogy (MCG) at each locus, where each MCG comprises 
a series of mutation and migration events, and the eventual absorption into a lineage of the sub- 



trunk in a certain dem e. See Figure 1(b) for an illustration. The key insight, initially provided by 
Wiuf and Heh] (Il999h . is that we can generate the conditional genealogy as a sequence of MCGs, 
rather than backwards in time. Moreover, though the sequence is not formally Markoy , it is well ap- 
proximated (jMcVean and Cardin . 1200.4 Marjoram and Wall 120061 : IPaul et all 1201 ll ) by a Markov 
process using a two-locus transition density. Applying this approximation to 7r Mig yields the sequen- 
tially Markov CSD 7r M i g sMc- For ease of exposition, we restrict attention to the case of sampling 
a single additional haplotype, denoted f), but the ideas generalize, in principle, to sampling two or 
more additional haplotypes. 

Since mutations can be superimposed onto the conditional genealogy, we first consider gener- 
ating a sequence of MCGs without mutations according to a Markov process. The genealogical 
process underlying 7r Mig yields the following sampling procedure for the MCG at an arbitrary locus: 
The ancestral lineage of the additionally sampled haplotype initially resides in deme a, where the 
additional haplotype is sampled, and proceeds backwards in time, subject to the migration pro- 
cess, until being absorbed into a lineage of the sub-trunk ^4o( n 7) within the current deme 7. The 
associated marginal distribution is used as the initial distribution at the first locus. 

Conditional on the marginal genealogy at locus I — 1, the marginal genealogy at locus I can be 
sampled by first placing recombination events onto the MCG at locus I — 1 according to a Poisson 
process with rate P(£-i : £)- If no recombination occurs, the marginal genealogy at locus I is identical 
to the one at locus t — 1. If recombination does occur, the MCG at locus I is identical to the 
MCG at locus £ — 1 up to the time tb of the most recent recombination event. At this point, the 
lineage resides in the same deme in which the ancestral lineage at locus £ — 1 resided at the time of 
the recombination event, and, independently of the lineage at locus I — 1, proceeds backwards in 
time, subject to the migration process, until being absorbed into a lineage of the sub-trunk „4o( n 7) 
within the current deme 7. Figure [2] illustrates this transition mechanism for the Markov process. 
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Figure 1: Illustration of the subsequent ap- 
proximations to the true conditional sam- 
pling distribution. The three loci of each 
haplotype are each represented by a filled cir- 
cle, with the color indicating the allelic type 
at that locus. The trunk genealogies in deme 
1 .4.0 ( n i) and deme 2 „4o( n 2), as well as the 
conditional genealogy C are indicated. The 
different denies are indicated by the white 
and the grey background. Time is repre- 
sented vertically, with the present (time 0) at 



the bottom of the illustration, (a) The ge- 
nealogical interpretation: Mutation events, 
along with the locus and resulting haplo- 
type, are indicated by small arrows. Recom- 
bination events, and the resulting haplotype, 
are indicated by branching events in C. Mi- 
gration events are indicated by switching to 
another deme. Absorption events, and the 
corresponding absorption time (t^ and t^) 
and haplotype (h^ and h^ b \ respectively), 
are indicated by dot-dashed horizontal lines. 
|(b) | The corresponding sequential interpreta- 
tion: The marginal genealogies at the first, 
second, and third locus are emphasized as 
dotted, dashed, and solid lines, respectively. 
Mutation events at each locus, along with 
resulting allele, are indicated by small ar- 
rows. Absorption events at each locus are 
indicated by horizontal lines, (c) The cor- 



responding sequential interpretation where 
just the deme of absorption, the time of ab- 
sorption, and the absorbing haplotype are 
recorded. The gap in the ancestral lineages 
indicates that the marginal conditional ge- 
nealogy is integrated out. 
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Fi gure 2: The transition density from locus £ — 1 to locus £ in the model underlying 7TMigSMc 
is illustrated. The white and the grey background symbolize the two different demes that the 
ancestral lineage can reside in. (1) A Poisson number of recombination events is placed uniformly 
onto the marginal conditional genealogy at locus £ — 1. (2) If the time % of the most recent 
recombination event is more recent than the time of absorption t^_i, then the marginal conditional 
genealogy up to this time is copied to locus I. (3) The ancestral lineage at locus £ evolves according 
to migration until it is absorbed at time te into the trunk in some deme. 



Conditional on the MCG at locus £, mutations are superimposed onto the MCG according to 
a Poisson process with rate 0£. The MCG is absorbed into a trunk lineage corresponding to some 
haplotype h, thereby specifying an "ancestral" allele h[£]. This allele is then propagated to the 
present according to the mutations and the mutation transition matrix pW, thereby generating an 
allele at locus £ of the additional haplotype. We refer to the associated distribution of alleles as 
the emission distribution. 

It is possible to write down explicit expressions for the initial, transition, and emission distribu- 
tions for 7r M igSMc- However, as the state space for the MCG at each locus includes the entire migra- 
tional history, an efficient algorithm for computing the CSP is not known. In the next subsections, 
we introduce further approximations to this model in order to admit an efficient implementation. 

Although we do not prove it here, we note that, analogous to lPaul et al.l (120111 ). the sequentially 
Markov version of the CSD can be obtained from the genealogical process introduced in Section [3] 
by prohibiting coalescence events in the conditional genealogy between lineages not ancestral to any 
overlapping parts of the additionally sampled haplotypes. In the case of sampling one additional 
haplotype, this corresponds to prohibiting all coalescence events in the conditional genealogy. This 
observation allows one to write down a recursive formula to compute probabilities under 7r M i g sMc> 
but this again does not lead to an efficient implementation. 



4.2 The CSD 

7TMigSMC Ao : Keeping track of the absorption time only 

As noted in the previous subsection, if we keep track of all demes in which the additional ancestral 
lineage at a given locus resides at any given time in the past, then the MCG is a complicated 
object. To remedy this, we approximate the full marginal genealogy by just recording the time 
until absorption, as well as the deme in which the ancestral lineage resides at the time of absorption 
and also the absorbing haplotype. The reduced MCG at locus £ is thus given by a triplet of random 
variables (Gf,T e A ,Hf) € r x R> x H, that is the deme of absorption Gf, the absorption time 
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T A , and the absorbing haplotype Hf. Henceforward, we use S to denote T x ]R>o x %. 

Now, observe that the marginal migration dynamics of the ancestral lineage at a single locus can 
be described by a continuous-time Markov chain with a finite state space. The states can be divided 
in two groups: one state for each deme denoting residence in that deme before being absorbed into 
the trunk, and another one for each deme to represent being absorbed into a lineage of the trunk in 
the given deme at some previous time. We denote the set of states by {1, ... ,g, a\, ... , a g }, where, 
for 1 < i < g, state i denotes residence in deme i, and state a, denotes absorption in deme i. The 
dynamics of the Markov chain is given by the (2g x 2g)-dimensional block-specified rate matrix 



Z 



M-A A 




where is a {g x g , )-dimensional matrix of zeros, M is the (g x ^-dimensional matrix of migration 
rates which govern the transitions between the first group of states (the non-absorbed states), and 
A is the (<7 x g)-dimensional diagonal matrix 



A 



Ik^ux ■■■ \ 
\ ••• Hg l n g ) 



which governs the transition into the second group (the absorbed states). The diagonal form ensures 
that the absorbed state aj can be reached only if the ancestral lineage currently resides in deme i. 
Also, note that absorption is proportional to the inverse of relative size k~ 1 of deme i, as well as 
the number of trunk- lineages nj in deme i. Because the absorbing states are also absorbing in the 
context of the Markov chain, the rows of Z corresponding to these states are set to zero. 

The process generating the conditional genealogy for the whole additional haplotype proceeds 
sequentially along the haplotype, and thus admits a natural interpretation in an HMM framework, 
where the MCG at a given locus is the hidden state and the allele of the additionally sampled 
haplotype at this locus is the emitted symbol. We now describe the initial density, the transition 
density, and the emission probability. 



4.2.1 Initial density 

Standard theory of Markov chains yields that the probabilities of interest for the initial density 
can be found in the respective entries of the transition semigroup. If the additional haplotype is 
sampled at present in deme a, the probability of residing in deme i and not being absorbed more 
recently than time t into the past is (e tz ) a ,i. On the other hand, the cumulative probability of 
being absorbed in deme i more recently than time t is given by {e tz ) a ^ ai . Thus, the initial density of 
state s = (u, h, t), that is the density of being absorbed in deme u at time t into the trunk-lineage 
of haplotype h, is given by the derivative of the latter matrix exponential: 

f (») ( s ) = ^-P{G A = to, T A < t , H A = h) 
dt 

dt \ riaj v !a ^) 
= r h±L( Ze tz ) . 
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The factor n^^/n^ comes from the fact that absorption into a specific lineage of the trunk is 
uniform amongst those present in deme u. 



4.2.2 Transition density 

The density for transition from locus I — 1 to £ using the full MCG, described in Section 14.11 
conditions on the full migration history of the lineage of the additionally sampled haplotype at 
locus £ — 1. Thus, at the time of a possible recombination event, all demes up to this event, 
including the deme where the event takes place are determined. If only the time and the deme of 
absorption are recorded, then the deme in which the ancestral lineage resides at the time of the 
recombination event is a random variable with a distribution determined by the dynamics of the 
Markov chain. Let G^ b denote the random deme in which the ancestral lineage of the additional 
haplotype resides at the time tf, of the recombination event. Then, for 7 € T, the distribution is 
given by 

\e tbZ ] IZe^- 1 ^ 2 ] 

m Sr B — rv I -+ r A -,, I- ' 7 h,*u t -i , , 

n^t b ~ 7 I - tt-x,G t _i - \ze^-i z ] ' ^ ' 

L J Q^Owa j 

The transition from sg-\ to sg is now given as follows: The time ij, of the potential recombination 
event is chosen according to an exponential distribution with rate prg—i t gy If i& > if— 1, no recom- 
bination occurs and the MCG sg is identical to sg_\. If i& < if-i, then recombination occurs and 
we use (I4.2p to determine the probability that the ancestral lineage resided in a certain deme at 
the time of the recombination event. The ancestral lineage at locus £ proceeds from time i& in this 
deme and is again subject to migration according to the dynamics of the Markov chain governed 
by rate matrix Z, until it is absorbed. Integrating over the possible times of the recombination 
event and summing over the different possible demes yields the following transition density of the 
hidden state sg at locus £, given sg_\ at the previous locus: 

^fak-i) = HGf = ut,lf € dt e ,Hf = h,\Gti = ut-i,T e A _i = te-uHf^ = h^} 



pi 



p -pbtl-lX 



[e tbZ ] a [Ze^-i-^ 2 ] 



71,... /, ^ Zpte-iZ] L iy,a u , °' 



(4.3) 



where ti-\ A tg denotes the minimum of tg~\ and tg . 



4.2.3 Emission probability 

Since the mutation rate does not depend on the deme in which the ancestral lineage of the additional 
haplotype resides, the emission probability at locus £ only depends on the absorption time tg and the 
allele of the absorbing haplotype at that locus hg[£]. As described above, a Poisson number (with 
mean tgOg) of mutation events is placed onto the MCG sg and the "ancestral" allele is propagated 
to the present according to the mutation transition matrix P^. Thus, the probability that the 
allele of the additional haplotype is t}[£], given the hidden state sg, can be written as the following 
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matrix exponential: 

I s t ) = F{Sj £ = t}[£] I Gf = u lt Tf = t t , Hf = h e } 



(4.4) 

where Sj£ denotes the random allele emitted at locus t. 



j-gt^/CpW-i)-] 



4.2.4 A hidden Markov model formulation to compute the CSP 

Using the quantities introduced in the previous Secti ons, we can now em ploy the forward algorithm 
for this HMM with continuous hidden state space (|Cappe et all l2005h by defining the quantity 
fj>(-) recursively. The base case for the first locus is 

and the recursive step for the transition from locus I — 1 to I is 



where b = (£ — 1, £). Finally, the CSP vr M i g sMc-Ao(fl| n ) of an additional haplotype f) given the already 
observed sample configuration n is given by: 

7TMigSMc-Ao(f)|n) = J fl{s k )ds k . 

The sequentially Markov genealogical process corresponding to the CSD 7r MigSM c-Ao is illustrated in 



Figure 1(c) 



Note that the dynamics of the Markov chain on the hidden states is reversible with respect to 
the initial density, i.e., 

<j>f\s'\s)^ a \s) = <j>^\s\s')C,^\s') 

holds for all recombination rates p € M>o and hidden states s,s' € S. Thus, the initial density £^ (•) 
is in fact the stationary distribution of the Markov chain on the hidden state space. Reversibility 
also ensures that the CSP computed starting at the first locus and proceeding forward is the same 
as the CSP computed when starting at the final locus and proceeding backward. 

4.3 Discretizing time 

The reduced hidden state space of the HMM introduced in the previous subsection yields an ap- 
proximation to the full sequentially Markov conditional sampling distribution. However, the hidden 
state space (in particular the absorption time) is continu ous, making impl ementation with stan- 



dard (discrete) HMM methodology impossible. Thus, as in lPaul et al.l (|201ll ). we propose a further 
approximation, by discretizing the positive real line into a finite number of intervals and recording 
the interval that the absorption time falls into. Formally, this corresponds to the approximation 
that the transition density and emission probability are equal for times that belong to the same 
interval. 

To this end, assume that = xq < x\ < ■ ■ ■ < = oo is a finite, strictly increasing sequence in 
M>oU{oo} such that T> = {Dj = [xj—i } &j)}j=i d 1S a partition of R>o into d intervals. We denote 
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the discretized hidden state space T x {1, . . . , d} xH by T, and the hidden states by a = (u),i,h) € £, 
where i is the index of the interval of absorption. As before, oj denotes the deme and h the trunk 
lineage of absorption. Based on the partition T>, denote the discretized version of the initial density 
as 

£( n ) ( a ) = F{G A = u, T A € A , H A = h} 
(( a \u),t,h)dt 



u(u, i) 



Tit, 



where 



u(cj,i)= [ (Ze tz ) dt = ie^ z ) -(e x ^ z ) . 



Note that the event {T A 6 D{} encodes that we only record the time interval in which absorption 
happens. 

Similarly, we can derive the discretized version of the transition density as 
(o*|<7*_i) = ¥{Gf = uj £ ,T a e D it ,Hf = ht\Gt_x = G ^n-x^tx = h-i} 

<j$ ; ^ ; | W£ _ 1 ; t t _ i , ^ _ i ) C (n) (ue- 1 , ^_ i , he- 1 ) i dt t 



Pb 



1 



C< n >fa_i) Jd h Jd^ 



y Pb {ut-i,k-i) -dst-use + Zpb&t, it\ut-i, k-i) ■ nu)tM 



(4.5) 



where explicit expressions of y Pb (u}£-i,u-i) and z Pb (u)£,ii\u)£-i,i£-x) are shown in Appendix [Cj 
Note that we again only record the intervals containing the absorption times at locus I — 1 and I. 
Finally, the emission probabilities in the discretized HMM can be obtained via 



ie?m\°t) = n^e = m\Gf = ^,r/ G d h ,h a = h e } 



Di 



and we again provide a more explicit form of this quantity in Appendix O Note that the emission 
probability (|4.4|) in the continuous case is only dependent on the time of absorption and the allele 
that the absorbing haplotype bears at the given locus. The discretized analog (|4.6p on the other 
hand also depends on the deme that the absorbing haplotype resides in. This is due to the fact that 
the latter conditions on being absorbed at any point in a given time interval, and since the rate of 
absorption during that interval depends on the deme, this dependence enters expression (|4.6p . 

With th e state space discre tized, the CSP can be computed via the standard forward algorithm 
for HMMs (jCappe et allbOQfil ). Thus, we define the quantity F^(ai) recursively along loci. At the 



first locus, we have 

The transition from locus I — 1 to locus I is given by 

a e _ 1 £'E 
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and the probability of observing haplotype f) under the discretized HMM is given by 



^MigSMC-AOD (0|n) := £ F*(a*), 
cr fe GS 

which provides an approximation to 7r M i g sMc-Ao(f)l n )- 
Remark. 

1. In IPaul et aD (|201lh . the authors advocate using a discretization based on points obtained 



from Gaussian quadrature. However, we obtained numerically more stable results when using 
a logarithmic discretization, that is Xj = — (1/r) log((d — where r is the harmonic mean 
of the absorption rates in each deme. 

The runtime of the standard implementation of the forward algorithm fo r HMMs describe d 
in the previous paragraph is quadratic in the number of hidden states. In Paul et al. ( 201ll ). 



the authors describe a straightforward implementation of their model that leads to a better 
bound on the runtime. Since our transition density is of similar form, a similar improvement 
can be applied here. 



5 Application: Estimating migration rates 



To demonstrate the utility of our approximate CSD 7r M i g sMc-AOD , we considered estimating migration 
rates for data simulated under the full coalescent with recombination and migration. In particular, 
we simulated data for k = 10 4 bi-allelic loci. For simplicity, we set 0£ = 5 x 10~ 2 and = 



1/2 1/2 
1/2 1/2 



for all £ G L, and pt = 5 x 10 2 for all b G B. We considered a structured population 



m2i — m ■ For each value of 
10 individuals in each of the 



with two demes (i.e., T = {1,2}), and set k\ = ki = 0.5 and mu 
m G {0.01,0.10, 1.00, 10.0}, we generated 100 datasets with n\ = ni 
two demes. 

Observe that the per-individual mutation and recombination rates are thus both approximately 
10 4 • 5 x 10 -2 = 5 x 10 2 . In humans, for which average per-base mutation and recombination 
rates are on the order of 10 -3 , these values correspond to a genomic sequence on the order of 
500 kb. We thus reason that the haplotypes we simulated are representative of a relatively longer 
genomic sequence that has been "compressed" , for reasons of computational efficiency, into 10 4 loci. 
Further, we chose t he range of migrat i on ra tes to be compliant wit h recent estimates in humans 
( Gutenkunst et al. . 2009 ; Gravel et all 2011 ). as well as Drosophila ( Wang and Hev . 2O10l ). 

We considered three approximate/composite likelihood formulations that make use of the CSD. 
Let n be a particular sample configuration of n haplotypes, and write n = Y17=i e 7i,^r 

LCL: In the Leave-one-out Composite Likelihood, the likelihood is approximated as a product of 
CSPs with each the result of removing a single haplotype from the sample configuration: 



LCL(n) 



n 



7T 



MigSMC 



1/n 



PAC: In the popular Product of Approximate Conditionals framework ( Li and Stephens . 20031 ) . the 
likelihood is approximated by a conditional decomposition, averaged over 2 random permuta - 
tions {o~j} of the haplotypes (this number of permutations is as suggested bv lLi and Stephens! ). 
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Figure 3: Re-scaled log likelihood surfaces for two sample configurations (generated for m = 0.10, indicated 
by a vertical line in the plots), and for each of the three approximate likelihood formulations (LCL, PAC, 
PCL) described in the text. In both cases, the likelihoods are computed using the true values of 9 = 5 x 10~ 2 



and p = 5 x 10 . (a) A case for which all of the likelihood surfaces are similar |(b)| A case for which the 



LCL likelihood surface is substantially different than the likelihood surfaces for PAC and PCL 
Defining ^(e^J = e^,^ : 

- 20 n i ^ 

PAC(n) = — ^Jj7r M ig S Mc-AOD(^(e 7ii/li )|n- ^ M^'A'), 



j=l i=l i'=l 



PCL: In the Pairwise Composite Likelihood, the likelihood is approximated as a product of CSPs 
with each a single haplotype conditioned upon a single haplotype: 

n 

PCL(n) = Y\ Y\ ^MigSMc-AOD(e 7i ,/ li |e 7 ., ! / l .,). 

i=l i'^zi 



We set the values of 9 and p to the (true) values used for simulation, and considered the approximate 
likelihood surfaces for the parameter m. Figure [3] shows the surfaces for two example configurations 
(generated as described above) for m = 0.10. Perhaps most importantly, the likelihood surfaces 
appear to be unimodal and otherwise well-behaved. In Figure 3(a) , the likelihood curves are quite 
similar to one another, and the maximum likelihood occurs near the true parameter. This is not 
generally true, however, as evidenced by Figure [3(b)| for which the likelihood surface for the LCL 
method is substantially different than that of PAC and PCL. 

We also considered the behavior of the maximum likelihood estimate (MLE) under each of the 
likelihood approximations. For each simulated dataset, we computed, using golden section search, 
the MLE migration rate m, and computed log 2 (m/m), where m is the (true) migration rate used 
to generate the dataset. In this way, results for different values of m are directly comparable; a 
correct estimate of the migration rate produces a value of 0, and under- and overestimation by a 
factor of two produce values of —1 and 1, respectively. To assess the performance of the MLEs 
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based on the CSD developed in this paper, we also compared with estimates obtained from the 
widely- used test statistic Fst : 



Fst 



It can be shown that the migration rate in a symmetric island model with two sub-populations 
can be estimated by 

m(n) = — | — — ; — : 



4 VFsT(n) 

where Fst(h) = 1— 7rs(n)/7TT(n), wi th 7rs(n) denot i ng th e average within-pop ulation diversity 
and 7TT(n) the overall diversity; c.f.. ICharlesworth (|l998l . Equation (4)). Note that, although 



Charlesworthl discusses three different estimators for Fst, the corresponding migration rate 



estimators coincide in models where the sub-populations have equal weights. 

For each true migration rate m £ {0.01,0.10,1.00,10.0}, box plots for the transformed MLE 
under each likelihood approximation and the Fgx-based estimator are presented in Figure HI Ob- 
serve that the LCL-based MLE performs very poorly for m = 0.01 (see Figure 4(a) ), consistently 



underestimating the true value; this may be because the final haplotype to be sampled is generally 
very similar to previously-sampled haplotypes within the deme, obviating the need for migration 
events within the conditional genealogy. Intuitively, this effect should be diminished when the data 
are produced using larger migration rates, which does appears to be the case (see Figures |4(b)| 
4(c)] and [4(d)] ). 

On the other hand, the PCL-based MLE performed poorly for m = 10.0, again consistently 
underestimating the true value. This may be because, for large migration rates, there simply is not 
enough information in a pairwise analysis of the haplotypes to determine the true rate; intuitively, 
this effect should be diminished when the data are produced using smaller migration rates, relative 
to the rate of recombination. This is indeed the case, and in fact, for smaller migration rates, the 
PCL-based MLE is well-correlated with the PAC-based MLE (data not shown). 

The PAC-based MLE appears not to suffer at either of these extremes. We speculate that this 
is because PAC incorporates both pairwise and higher-order terms, making it less susceptible to the 
problems we observe with the LCL- and PCL-based MLEs. We note that Li and Stephens! ( 2003 ) 
came to a similar conclusion. 

For low migration rates, the method based on Fst consistently overestimates the true rates (see 



Figure 4(a) ), but shows a small variance. For intermediate migration rates, on the other hand, it 
produces underestimates (see Figures |4(b)| and 4(c)), and the variance is larger than that of the 
PAC-based MLE. The estimates for large migration rates (see Figure |4(d)[ ) are similarly biased, 
although the variance is comparable to the MLE methods in this case. Overall, the PAC-based 
estimation is quite accurate, demonstrating that, using the CSD 7r M i g sMc-AODi it is possible to attain 
accurate estimates of the migration rate. 



6 Discussion 

Numerous applications in population genomics make use of the conditional sampling distribution, 
so developing accurate, efficiently computable CSDs for various population genetic models is of 
much interest. Recently, we proposed an accurate sequentially Markov CSD that follows from 
approximating the diffusion process dual to the coalescent with recombination for a single panmictic 
population. In this paper, we have extended that approach to incorporate subdivided population 
structure with migration, providing a novel CSD that facilitates computation and also admits a 
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Figure 4: Box plots (produced using the software package R, and including outliers) for the quantity 
log 2 (m/m) over 100 samples, where rh is the MLE under each of the three approximate likelihood formula- 
tions (LCL, PAC, PCL) or the FsT-based estimate as described in the text. The MLE values rh were found 
using golden section search within the interval (m • 10 _1 , m ■ 10) (a) m = 0.01 |(b)| m = 0.10 (c) m = 1.00 1 (d) | 
m = 10.0. Note that the median of the LCL estimator in (a) lies on the lower bound of the interval, thus at 



least half of the estimates reach this bound and arc most likely even smaller. 



useful genealogical interpretation closely related to the structured coalescent with migration and 
recombination. We believe that this extension will have several interesting applications, some of 
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which we list below. 

Recalling the applications of CSDs described in the introduction, we note that it is straightfor- 



ward t o apply 7r M i g sMc-AOD to annotate segments of distinct ancestry in individuals. As in lPrice et al 



(2009), ;he already observed configuration consists of the donor individuals from different popu- 
lations. For low migration rates, the model underlying vr M igSMc-AOD leads naturally to the fact 
that, following a recombination event, the ancestral lineage at the next locus is more likely to get 
absorbed in the same deme, rather than switching demes by a m igration event and then getting 
absorbed in a different deme. Whereas the method developed by iPrice et al. is applicable 

for recently admixed individuals, we expect our model to be more accurate in situations where the 
mixing of the p opulations happened oy er a long time through the continuous exchange of migrants. 

Recall that IWegmann et al.l (120 111 ) estimated relative recombination rate variation in different 
populations based on ancestry switch-points in the chromosome detected using the method of 
Price et al.1 fcoOfll ). As detailed above, our model can be extended to detect ancestry switch-points 



in populations that mixed over long periods of time. In such situations we expect t hat the segments 



of diff erent ancestry detected by our method can be used in a similar fashion as in lWegmann et al 



(|201lh to analyze recombination rate variation in different subpopulations, when no strong recent 

admixture is evident. 

Recently, iLi and Durbinl (|201ll ) performed a related analysis of human demography. They used 
the SMC in the special case of a sample consisting of only two sequences, and thus were able t o 
obtain explicit transition functions along the sequence, as we did for our CSD jPaul et trillion] ). 



Li and Durbinl incorporated changes in the size of the population into their model, thus allowing 



them to use the two sequences of a diploid individual to infer population size histories of different 
human populations. They do not explicitly account for population structure and migration in their 
analysis, but we believe that the me thods developed in t his paper could be readily incorporated 
into their model. In a similar study, Mailund et al. ( 201ll ) used a pair of sequences sampled from 
different populations in the SMC framework to estimate ancestral population sizes and splitting 
times in an isolation model. Again, we think it is possible to incorporate migration into the model 
usi ng the ideas we develo ped in this paper. 

Paul and Sond faoij ) have recently developed a framework to substantially increase the speed of 



computations involved in dealing with HMMs for next- generation sequen ce data, and they demon- 
strate their improvements in the model introduced by iPaul et al.l (|201ll ). Utilizing the fact that 
whole genomic sequence data consists of long stretches without sequence variation in between SNPs, 
and that the observed variation can be described by a small number of haplotype blocks, they were 
able to decrease the computation time by several orders of magnitude. The same ideas can be ap- 
plied to speedup our method, fostering the application of analyses like the one detailed in Section [5] 
or similar applications to whole genome sequence data. 

The CSDs developed in this paper assume that the structure underlying the population remains 
unchanged throughout the whole course of evolution. Furthermore the rates at which migrants are 
exchanged are assumed constant. This is mirrored by the fact that the Markov chain described 
by the matrix Z is time homogeneous and the number of states does not change. For more re- 
alistic populations one would pose changes of the population structure and the population sizes 
at different points in the past, as well as varying rates of migration. The methods developed in 
this paper can be readily extended to scenarios where the structural parameters of the underlying 
demography are piecewise constant for given periods of time. This can be implemented by allowing 
the Markov chain, governing the absorption of the additional ancestral lineage, to be piecewise ho- 
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mogeneous. Except for the work of Davison et al. ( 20091 ) and Price et al. ( 20091 ). we are not aware 
of any other CSDs that try to incorporate explicit population structure into the copying model. 
Such a CSD accounting for a more general demographic model would allow one to estimate more 
general demographic parameters like ancient population sizes and structure, as well as migration 
rates, and duration of periods of migration in certain isolation-with-mi gration scenarios, using , 
for example, the framework illustrate d in Section El importa nce sampling ([Stephens and Donnelly! . 
200C; Fearnhead and Donnelly . 2001 ; Griffiths et al. . 20081 ) . or othe r frameworks detailed in the 
introduction. Mvers et al. ( 20081 ) show that demographic studies like Gutenkunst et al. ( 20091 ) and 



Gravel et al.1 (|201lh . that rely exclusively on the frequency spectrum, can be limited in resolving de- 



mographic parameters, and methods, as the one developed in this paper, that explicitly incorporate 
linkage structure, might alleviate such problems. 
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A Diffusion approximation 



We here provide a derivation o f t he sampling recursion using the diffus i on gen erator approximation 
(foe Iorio and Griffiths! . l2004al lbl: iGriffiths et all . 120081 : IPaul and SoneL l20ld ). The diffusion asso- 
ciated with the coalescent including migration and recombination has state given by the vector 
x = (x 7j ^)^ er h ^ where x 7j /i is the frequency of haplotype h within deme 7. The generator for 
the diffusion can then be written 



JSf/(x) = J2 C ^ 



a 



7er 
hen 



Ox 



7, h 



■/(x), 



where 



^ h'eH Xl ' h ' 



+ J2 d * X! x 7,5 < »(ft)( P 
leL aeEi 



+ X^ b ( XT x -y,K b (h,h') x ~(,K b (h',h) -^7^)/( x 



beB 



+ ( m 77 ,2 V,/i - m 7 x 7,^)/( x ) p 



7V7 

and / is an arbitrary, bounded, twice-differentiable function with continuous second derivatives. 

Denote the probability of obtaining (at stationarity) an ordered sample configuration n by q(n). 
Then q(n) = E[q(n|X)], where E denotes expectation with respect to the stationary distribution of 
the diffusion, X denotes the random vector of frequencies, and q(n\x) = I^er II/igw x ^k ■ Finally 
given an additional haplotype configuration c, the true conditional sampling probability is, by 
definition, 7r(c|n) = q(c + n)/g(n). 

By general diffusion theory, E[j£f/(X)] = 0. The diffusion generator approximation assumes the 
existence of a distribution, with associated expectation E, such that the previous condition holds 
component-wise; that is for each 7 £ T and h £ H, 



E 







dX, 



■y,h 



-/(x) 



0. 



By analogy to the exact case, we assume that the sampling distribution is approximated by q(n) = 
E[g(n|X)], and define the approximate CSD 7r Mig (c|n) = q(n + c)/q(n). Using the component- wise 
approximation above, 



EE 

7er hen 



Cy.h 



-E 







■"y,h 



dX. 



7 J 1 



-g(n + c|X) 



0. 



Using the expressions for Cy t h an d q(n + c|X), together with the definition q(n) = E[g(n|X)], we 
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obtain, after some algebra analogous to IPaul and Sonel ||20icl . Appendix) 



C-f,h 



■yer 
hen 



zZ°^H P ii.M^ n + c " e i,h + ^,s«(h)) 

£gL a£E e 

^Pb^2 4(n + c- e Jjh + e l! - R . b {h,h') + e ^,n b (h',h))) 



m w q(n + c 

7Y7 



where the normalizing constant is given by 



M = Z ^ [( n i + c 7 - 1)« 7 1 + Y 9e + Y P b + 
7er 



m-. 



l£L b&B 



Dividing this result by q(n), we thus obtain the given recursion for 7r Mig (c|n). It is also possible to 
derive , in much the same way, the "reduced" version of this recursion; for details, see lPaul and Song 
(|20ld ). 



B Explicit transition density 

We begin by assuming that the matrix Z is diagonalizable, which is true if and only if M is 
diagonalizable. In this case, the matrix exponentials in equations (14. ip and (14. 3|) admit the eigen- 
decomposition {e tz ) iy j = J2k=i etXk ( v k w k) and {Ze tz ) iy j = J2k=i ^e tXk {v k w k ). Here X k are the 
eigenvalues of Z, v k are the eigenvectors, and w k are the rows of the inverse of the matrix of 
eigenvectors V = (t>i, . . . , V2g)- This eigen-decomposition can be used to evaluate the matrix 
exponential in equation (14.11) . and to compute the integral in equation (14.31) analytically as 

</>% ) (8 t \ ae - 1 ) = e-"> t '-i6 Bt , at _ 1 

n 29 2g 29 

+ n T Ht (7,h fz\ EZ) zZ Yl [(VkWkU^W^a^iVnWn)^ 

n ^ l Ze ) oca^ jeTk=lm=ln=l 

x A m A n e^- lA "V^I^ lA ' f (A fc - \ m - X n - p h ) 



where 



IJ(A)=/" e - dt= ifA ^°' (B.1) 



t=a 



b-a, if A = 0. 



Note that for a non-diagonalizable matrix, a similar eigen-decomposition can be employed using 
generalized eigenvectors and the Jordan normal form, and similar, though more involved, explicit 
computations can be performed. 
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C Probabilities in discretized HMM 



We now give more explicit forms of the quantities involved in the probabilities of the discretized 
HMM, derived using the eigen-decomposition of the extended migration matrix Z = VAV^ 1 . 
Inserting equation (|4.3j) and (|4.ip into the second to last line in equation (|4.5p . combined with the 
eigen-decomposition (Ze tz )ij = Yl1=i Afce* Afc ( v k w k)i,j yields 

1 29 

y p (uj£-i,i) = — — - V,(^^fe)a, 0w A jfc I^_ 1 (A jt - p), 
v ' ; k=i 

for the first term in the last line of equation (I4.5p . with I„(A) defined as in equation (IB.ip . For the 
second term we get 

2g 2g 2g 

v k w k)a,'y{Vm w m)'y,a UJ (^nf^n^a^/ 

^ ' ' 7Gr k=l m=l n=l 

< X - r v -' ' ' A 

_ „A m Xi \ n Xj-lT X i 



Finally, using equation (|4,6p one can show that 

1 \E\ 2g 

I* (b|w,t,/0 = —, 7T^^(qjVj)h[l]m( V k W k)a,a^ktc\_A^k + 0^ ~ 0) 

U ^^> j=l k= l 

holds, where E is the set of alleles at the given locus, and we used the eigen-decompositions of 
Z and the mutation matrix P = Qdiag( / ui, . . . , H\e\) Q . Here fij are the eigenvalues of the 
mutation matrix, Q = (qi, . . . , q2 g ) is the matrix which has the eigenvectors of the mutation matrix 
as columns, and pj denotes the j-th. row of Q . 
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